The representation of women in film is deeply flawed and biased. Despite the fact that about 49.6% of our global population is women (ourworldindata.org, 2017), the vast majority of films offer little to no representation.

An interesting measure of this representation uses the Bechdel Test. This test was coined by Alison Bechdel's comic in 1985. image.png

According to this comic, to pass the Bechdel Test, the film must satisfy three criteria:

(1) The film has at least two women in it, (2) the women in the film must speak to each other at least once, (3) the conversation between the women must be about something besides men.

Throughout this tutorial, I will be investigating three questions:

(1) How many films pass the Bechdel test per year? (2) Has this ratio improved over time? (3) How do films passing the Bechdel Test perform in comparison to those who don't?

Step 1 -- Data collection/curation + parsing

Bechdeltest.com provides an easy to use API that is updated regularly. The website provides four methods to query the list: getMovieByImdbId, getMoviesByTitle, getAllMovieIds, getAllMovies. I used getAllMovies as I will be working with the entire dataset. The query returns a JSON object containing the following information about each movie: year, rating, title, id, imdbid.

Further, on the website there are links to add new movies and suggest a re-rating of a movie.

In [51]:
import requests
import json
import pandas as pd
import matplotlib.pyplot as plt

# requests data for all movies in list. 
response_API = requests.get('https://bechdeltest.com/api/v1/getAllMovies?')

# accesses text of response
data = response_API.text

# loads data into json format
json = json.loads(data)

# converts json into dataframe
df = pd.DataFrame(json)

# filters out movies before 21st century 
df_21 = df[df['year'] >= 2000]
  
      
print(df_21)
        imdbid  year                 title     id  rating
3586   0199753  2000            Red Planet     15       0
3587   0209144  2000               Memento     52       1
3588   0144084  2000       American Psycho     64       3
3589   0164052  2000            Hollow Man     78       3
3590   0183523  2000       Mission to Mars     90       1
...        ...   ...                   ...    ...     ...
9493  12412888  2022  Sonic the Hedgehog 2  10279       3
9494   8115900  2022         Bad Guys, The  10280       3
9495   8851148  2022      In Between , The  10290       1
9496   5108870  2022               Morbius  10292       1
9497   7657566  2022     Death on the Nile  10297       3

[5912 rows x 5 columns]

In order to address my third question, 'How do films passing the Bechdel Test perform in comparison to those who don't?', I also need access to a dataset containing profits of all the movies. The-numbers.com proved to be a great resource for this. Through a web-based API, this service connects users to dataset containing an endless supply of financial data on films.

For this project, I was specifically interested in production budgets, domestic and international box office numbers, and popularity on streaming services.

Step 2 -- Data Management and Representation

For this step, I aim to prepare and tidy datasets such that I can easily use them to perform data analysis.

In order to answer my first question, 'How many films pass the Bechdel test per year?', I need to compute the ratios of passing to total films for each year.

In [50]:
# lists to store necessary components of entire dataframe
years = []
ratios = []
total = []
passed = []
ratios_df = pd.DataFrame()


# Computes ratio of films which passed Bechdel Test per year. 
# Also, stores years, ratios, total number of films, and number of films which passed Bechdel test (rating = 3)
# in respective lists. 
for name, group in df_21.groupby('year'):
    tot_pass = len(group[(group['rating']==3)])
    tot_len = len(group['rating'])
    years.append(name)
    ratios.append(tot_pass/tot_len)
    total.append(tot_len)
    passed.append(tot_pass)
    # print(str(name) +': '+str(tot_pass/tot_len))

# Combines data into one dataframe. 
ratios_df['Year'] = years
ratios_df['Passed Films (P)'] = passed
ratios_df['Total Films (T)'] = total
ratios_df['Ratio (P/T)'] = ratios

ratios_df.head()
Out[50]:
Year Passed Films (P) Total Films (T) Ratio (P/T)
0 2000 96 157 0.611465
1 2001 112 179 0.625698
2 2002 106 191 0.554974
3 2003 104 173 0.601156
4 2004 127 206 0.616505

Step 3 -- Exploratory data analysis

In [95]:
fig = plt.figure()
ax = fig.add_axes([0, 0, 1, 1])
ax.bar(ratios_df['Year'], ratios_df['Passed Films (P)'], color = 'b', width = 0.6)
ax.bar(ratios_df['Year'] + 0.6, ratios_df['Total Films (T)'], color = 'g', width = 0.6)
ax.legend(labels=['# Films that passed Bechdel Test', 'Total # of films released'])
ax.set_title('Number of Films that Passed Bechdel Test Compared to Total Number of Films Per Year')
Out[95]:
Text(0.5, 1.0, 'Number of Films that Passed Bechdel Test Compared to Total Number of Films Per Year')
In [105]:
x = ratios_df['Year']
y = ratios_df['Ratio (P/T)']

#find line of best fit
m, b = np.polyfit(x, y, 1)

plt.scatter(x, y)

#add line of best fit to plot
plt.plot(x, m*x+b)

plt.title("# Passed Movies / Total # Movies over Time")
plt.xlabel("Year")
plt.ylabel("# Passed Movies / Total # Movies")

plt.show()

print("Average Ratio: " + str(ratios_df['Ratio (P/T)'].mean()))
print("Slope of Line of Best Fit: " + str(m))
Average Ratio: 0.6320283175607141
Slope of Line of Best Fit: 0.005413905408588168
In [ ]: